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Exponential families are the workhorses of parametric modelling theory. One reason for their 
popularity is their associated inference theory, which is very clean, both from a theoretical and a 
computational point of view. One way in which this set of tools can be enriched in a natural and 
interpretable way is through mixing. This paper develops and applies the idea of local mixture 
modelling to exponential families. It shows that the highly interpretable and flexible models 
which result have enough structure to retain the attractive inferential properties of exponential 
families. In particular, results on identification, parameter orthogonality and log-concavity of 
the likelihood are proved. 

Keywords: affine geometry; convex geometry; differential geometry; dispersion model; 
exponential families; mixture model; statistical manifold 

1. Introduction 

The theory of local mixture models is motivated by a number of different statistical mod- 
elling situations which share a common structure. These situations include overdisper- 
sion in binomial and Poisson regression models, frailty analysis in lifetime data analysis 
(Anaya-Izquierdo and Marriott [3]) and measurement errors in covariates in regression 
models (Marriott [12]). Other applications include local influence analysis (Critchley and 
Marriott [6]) and the analysis of predictive distributions (Marriott [11]). 

Univariate exponential models defined on R are, in terms of the natural parameter, of 
the form f(x; 9) = exp{fe — k v {6)}v{x) with respect to some cr-finite measure v on R. An 
alternate and important parametrization is the expected parameter fx, where the transfor- 
mation from the natural parameter is defined by fi = Ef( x .^(X). Throughout, regularity 
conditions on parametric families similar to those in Amari [1], and stated in Anaya- 
Izquicrdo [2], are assumed. Let F = {/(x; £ M} be a given regular parametric family. 
Now, let xi,...,x n be a random sample from the distribution g(x; Q) = J f(x; li) &Q(li) 
for some unknown proper distribution function Q. This paper then focuses on making 
inferences about g(x; Q) given x\,..., x n . 

The common structure which is shared in the applications above is that a relatively 
standard model has been fitted, say a member of the exponential family, and this model 
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explains most of the variation in the data. However, more detailed analysis shows that 
there is still some unexplained variation which the analyst would like to deal with. A 
common way of modelling this unexplained variation is through mixing. The mixing is 
local in that it is, in some sense, 'small' and not the dominant source of variation in 
the problem. This paper looks in detail at what notions of 'small' mixing might mean 
and how it can be dealt with. In order to keep the presentation focused, a running 
example concentrates on overdispersion and mixing in the binomial example, but the 
theory is much more general, covering local mixing over any exponential family fulfilling 
the regularity conditions. 

The main focus of the paper involves looking at the way that different assumptions on 
the mixing mechanism can be unified using the structure of the local mixture model. The 
local mixture structure (Definition 2) provides a way of reducing an infinite-dimensional 
problem to a finite-dimensional one such that the loss involved can be characterized, 
(Theorem 6). The resulting computations in local mixture models are straightforward 
since they exploit log-concavity properties of the likelihood function (Theorem 4), as well 
as identification (Theorem 2) and orthogonality between interest and nuisance parameters 
(Theorem 2) . The structure of the local mixture also naturally indicates which points in 
a data set are inferentially highly influential (Section 2.1). 

2. Local mixture models 

This paper follows a geometric approach and works by embedding simple exponential 
families in an infinite-dimensional space which is general enough to contain all models 
which can be constructed by mixing. The following definitions define both the embedding 
space and a local mixture model of a regular exponential family. 

Definition 1. Consider the affine space defined by (Xmix, Vmix, +} ■ In this construction, 
the set Xjviix is defined as 



a subset of the square-integrable functions from the fixed support set S to R, and v is 
a measure defined to have support on S. On this set, an affine geometry is imposed by 
defining the vector space t^Mix- 





Finally, the addition operator is the usual addition of functions. 
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Definition 2. The local mixture model of a regular exponential family f(x\ fx) is defined 
via its mean parameterization as 

r 

g{x\ fi, A) := f(x; fx) + \ k f {k) (x; fx), 

where f^ k \x\fi) — -g-x f (x\ fx) . Here, r is called the order of the local mixture model. 

The hard boundary of the local mixture model is defined as the subset of the parameter 
space where 

g(x;fi,X) =0 

for some x in the support off. 

In order to see the link between Definitions 1 and 2, note that f(x;n) S -^Mix and, 
furthermore, by the regularity conditions, all ^-derivatives of f(x; fi) are elements of V^tix- 
Also, note that elements of Xmi* are not restricted to be non-ncgativc. Rather, the space 
of regular density functions, J 7 , is a convex subset of the affine space (Xyiix, Vjyrix) +)• It 
follows that restricting the family g(x] fx, A) to T induces a boundary in the parameter 
space. 

Exam/pie 1 . The local mixture model of order 4 for the binomial family has a probability 
mass function of the form 

n\jjL x (n — fi) nx 

g{x;/j,,X 2 , A 3 , A 4 ) = —r- -— — {1 + A 2 p 2 (^,M) + ^3(2;,^) + A 4 p 4 (a;,M)}, (!) 

x\{n — x)\n a 

where Pi are polynomials in x. 

The following example shows a way in which local mixtures can be qualitatively dif- 
ferent from mixtures and motivates the definition of a true local mixture (Definition 
3). 

Example 2. Consider the following example of local mixing over the normal family 
4>(x;n, 1), with known variance of 1. The local mixture model of order 4 is 

cf>(x\n,l){l + A 2 (-l + x 2 - 2xn + 11 2 ) + \ 3 (-3x + x 3 - 3a; 2 fx + 3xfi 2 + 3/i - ft 3 ) 
+ A 4 (3 - 6a; 2 + 12xfi - 6/i 2 + x A - 4x 3 fx + 6x 2 fi 2 - Axfi 3 + fj, 4 )}. 

It is easy to show that the variance of g(x; fi, A 2 , A3, A 4 ) is 1 + 2A 2 . Consider the model 
g(x; fx, —0.01, 0, 0.003). This is a true density since the parameter values satisfy the pos- 
itivity condition 

g(x;ii, A) > VieS; 

however, its variance is less than 1. So the local mixture model has parameter values 
which result in a reduced variance when compared to the unmixed model <f>(x; fi, 1). This 
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runs counter to the well-known result that if mixed and unmixed models have the same 
mean, then the variance should be increased by mixing; see, for example, Shakcd [17]. 

This example shows that the class of densities which arc local mixture models is too 
rich to use for studying inference on all mixtures. It might be tempting to restrict the 
class to lie in the convex hull of the full exponential family inside the infinite-dimensional 
affinc space {Xyw^, VMix, +} when this space is given enough topological structure for the 
Krein-Milman theorem to hold (Phelps [16]). It is surprising to note that there exist 
examples where the local mixture model does not lie in this infinite-dimensional convex 
hull unless A = 0; such examples include mixtures of the exponential distribution (Anaya- 
Izquierdo and Marriott [3]). 

The following definition of a true local mixture ensures that a finite number of natural 
moment-based inequalities for mixtures also hold for local mixtures. It also allows the 
parameters of the true local mixture model to have a natural interpretation in terms of 
possible mixing distributions. 

Definition 3. An order local mixture model g(x;fj,,X) of the regular exponential family 
f(x\fi) is called true if and only if there exists a distribution Q^.\ and corresponding 
exact mixture 



such that the first r moments of both distributions agree. 

True local mixtures can be characterized in terms of convex hulls in finite-dimensional 
affinc spaces in the following way. Let XJj ix denote the convex subset of -X"mix where the 
first r moments exist, then define the r- moment mapping from XJ^ ix to an r-dimcnsional 
vector space via 



Theorem 1. Let f(x;n) be a regular exponential family, M a compact subset of the 
mean parameter space and let the order r local mixture of f(x;/j,) be g(x;fx,X). 

(i) If for each, ji the moments M r (g(x; fx, A)) lie in the convex hull of 



then g(x;[i,\) is a true local mixture model. 

(ii) If g{x; [jl,X) has the same r-moment structure as J f(x;m)dQ fJ _ t \(m), where Q^ t \ 
has support in M, then the moments A4 r (g(x; fi, A)) lie in the convex hull of 




M r {f) = (E f (X),E f (X 2 ), . . . ,E/(X r )). 



{M r (f(x;[i))\tieM}cR r , 



{M r (f(x;ri)\fieM}cR r . 



Proof. See Appendix. 



□ 
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2.1. Statistical properties 

This section shows that (true) local mixture models have extremely nice statistical prop- 
erties. In particular, they are identified, have nice parameter orthogonality properties 
and the log-likelihood function is very well behaved. The following definition will be used 
throughout. 

Definition 4- If f{x',n) * s a natural exponential family in the mean parametrization, 
then Vf(n), defined by 



is called the variance function of the natural exponential family. 

If the variance function Vf(fi) is quadratic, then the corresponding exponential families 
have very attractive statistical properties (Morris [15]). Examples include the normal, 
Poisson, gamma, binomial and negative binomial families, which form the backbone of 
parametric statistical modelling. One example of the special properties is given by the 
following result. 

Theorem 2. Let f{x\pL) be a regular natural exponential family and n the mean 
parametrization. The local mixture model g(x;fj.,X) is then identified in all its param- 
eters. 

Furthermore, if the variance function Vf (/i) is a polynomial of degree at most 2, then 
the (/x, A) parametrization is orthogonal at A = 0. 

Proof. See Appendix. □ 

The following result shows that when working in the mean parametrization, the A 
parameters have a direct interpretation in terms of the mixing distribution. It also shows 
the reason for dropping the first derivative in the local mixture expansion, which explains 
the difference between the definition given here and that in Marriott [11]. 



Theorem 3. Let g(x;ii,X) be a true local mixture for the regular exponential family 

f(x;fj>)- 





(ii) If it is further assumed that f(x;fi) has a quadratic variance function V(ii) such 
that2 + V [2 \p) > 0, then A 2 > 0. 
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Figure 1. The log- likelihood function on a fiber of the local mixture of a binomial model. The 
hard boundary is shown by the dashed lines (A) while the singularity in the log-likelihood occurs 
at the dotted lines (B). Just one point in the sample has changed between the two plots. 



Proof. Part (i) follows from the properties of conditional expectations, while (ii) follows 
by direct calculation. □ 

From Morris [14], Table 1, the condition on the variance function in Theorem 3 holds for 
the normal, Poisson, gamma, negative binomial and binomial (for size > 1) families. Note, 
however, that there are examples of exponential families where the variance function is 
non-quadratic, such as the inverse Gaussian; sec Lctac and Mora [10]. 

Formally, local mixture models are examples of fiber bundles which (Amari [1]) has 
shown, can have very attractive statistical properties. In this paper, the model g(x; /io, X) 
for a fixed no is called the fiber at /xo . The following theorem shows that the log-concavity 
of the likelihood function, one of the most important properties of natural exponential 
families, is paralleled in the fibers of local mixture models. 

Theorem 4. The log-likelihood function for X for a fixed, known [io: based on the density 
function g(x; fio, X) and the random sample Xi,...,x n , is concave. 

Proof. See Appendix. □ 

There is a clear parallel between the shape of the log-likelihood on a fiber and on an 
exponential family. One difference between these two cases is that in the fiber, there can 
be a singularity where the log-likelihood tends to negative infinity. This happens when a 
data point x is observed in the sample such that g(x; [i, X) = 0. This can only happen on 
or outside the hard boundary. This property is illustrated in the following example. 

Example 1 {Revisited). The local mixture for a binomial model has a hard boundary 
in a fiber which is defined as the intersection of half-spaces in the parameter space. For 
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example, the fiber of the local mixture model of order 3, g(x\ /i, A2, A3), has a parameter 
space which is a subset of R 2 , as shown in Figure 1. The hard boundary is defined by 
the intersection of half-spaces of the form 



where £ {0, . . . , n} and g 2 and ^3 are polynomials in x. 

In Figure 1 the log-likelihood for this fiber is shown as a contour plot in a case where 
n = 10. In both panels, the hard boundary simplifies as 



{(A 2 , A 3 )|l + A 2 g 2 (10, Mo) + A 3 g 3 (10, Mo) > 0} n {(A 2 , A 3 )|l + A 2 g 2 (0, Mo) + A 3 g 3 (0, Mo) > 0} 



and the hard boundary is shown as dashed lines. In general, for the binomial case, this 
hard boundary is determined by the extreme points of the sample space. 

The log-concavity of the likelihood is clear in both plots and the singularities in the log- 
likelihood can also be seen. In the left-hand panel, the sample size is 50 and singularities 
can be seen along the dotted lines defined by 



where x m is the maximum (minimum) observed value in the data set which happens to 
be 8 (1). In the right-hand panel, the log-likelihood for the fiber is shown with the same 
data, except that one of the observations, which was 8, has been changed to 10. The 
singularity has jumped and now lies on the hard boundary. Thus it can be seen that, 
unlike exponential families, the log- likelihood in local mixtures can be very sensitive to 
a single data point and is especially sensitive to large or small observations. 

3. Asymptotic approximations 

As stated in the Introduction, the aim of this paper is to explore how the space of (true) 
local mixture models can be seen as an approximation to the space of all mixtures, with 
the added benefit of having the good statistical properties shown in the previous section. 
The use of truncated asymptotic expansions provides a direct link between exact and 
local mixture models. 

3.1. Laplace expansions 

Consider the mixture density defined by 



where Q is a distribution over the parameter space O. Note here that the choice of 
parameter is general and not restricted to m- If the mixing distribution is unknown, 



{(A 2 ,A 3 )|1 + A 2 9 2 (x l; Mo) +A 3 g 3 (xj,Mo) > 0}, 



1 + A 2 g 2 (x m ,Mo) + A 3 g3(x m ,Mo) = 0, 




(2) 



630 



K. Anaya-Izquierdo and P. Marriott 



(2) appears to define an infinite-dimensional family over which inference would appear 
to be difficult. Local mixture models use modelling assumptions on the 'smallness' of 
the mixture to approximate the class of models given by (2) by a finite-dimensional 
parametric family, where the parameters decompose into the interest parameter 9 and a 
small number of well-defined and interpretable nuisance parameters. When the mixing 
distribution is continuous, one sensible and useful interpretation of smallness is that the 
mixing distribution is close to a degenerate delta function, that is, it is close to the case 
of no mixing. In such an example, a Laplace expansion gives an asymptotic tool which 
enables us to construct the local mixing family; see, for example, Wong [19]. To formalize 
this, assume that the mixing distribution is a member of the following family. 

Definition 5. A model of the form 

q(wm,e) = a(e)V~ 1/2 (n)exp^—^-d(n;m)^ (3) 

is called a proper dispersion model if the unit deviance d is a non-negative, twice con- 
tinuously differ entiable function satisfying d(0) = 0, > for p, ^ and d"(0) > 
and, for /i,m in the parameter space, there exist suitable functions a and V(n), the unit 
variance function defined as V(fi) = 2(J^(/x, /x)) -1 ; for details, see J0rgensen [9]. 

Theorem 5. Let T = {f(x;0) :9 G 0} be a regular family and also let 

Q = {dQ(6>;tf, £ ):tf e6,£>0} 

be a family of proper dispersion models defined on 0. The Q-mixture of J- has the asymp- 
totic expansion 



g{x;Q(0,&,e)) 



J e f(x; 6)y-W{0) exp(-rf(fl, fl)/(2e)) d8 
J© V r - 1 /2(6i) exp(-d(0, 0)/(2e)) dO 

2r 

~ /(*;!?) +X> i (tf, £ )/«(^) +O x Ae r+1 ) (4) 

i=l 

as e — ► 0, for fixed -d G and x and for functions Ai such that 
A l (d,e) = Ov(s u ^), 

EQ[( V^ ] ~ M*, e) + O* i = 1, 2, . . . , 2r, 

where u{i) = [(i + 1)/2J . 

The following alternative expansion is also valid: 

It 

g(x; Q(0, 0, e)) ~ f(x; M, (0, E )) + £ M<(0, (x; M x (0, E )) + 0.,„(e r+1 ), (5) 
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for functions Mi such that 



E Q l9]~M 1 (ti,e) = d + A 1 (tf 7 e) + Oo{s 3 ), 
M i (0 1 e) = O<,(e«% 



E Q [(fl-E q [fl])'] 



M l (d,e) + Oo(e r+1 ), 




If the density f(x;9) and all of its derivatives are bounded, then the statement will be 
uniform in x. 



Expression (4) is an expansion around the mode of the mixing distribution, while (5) is 
an expansion around the mean. Note that this latter expansion is not actually centered at 
the exact mean, but at the function Mi(#, e), which is very close to the exact mean when 
e is small. It follows immediately that expansion (5) is of the form given in Definition 2 
and is therefore a local mixture, after truncating the remainder term. The form (4) can 
be thought of either as a direct Laplace expansion, as it was in Marriott [11], or as a 
simple reparametrization of (5). This fact shows the generality of Definition 2. 

Note that when the parameter space of f(x;fj,) has boundaries, the class of possible 
mixing distributions must be adapted to take them into account. This issue is fully 
explored in Anaya-Izquicrdo [2], which shows similar results to Theorem 5. 

3.2. Discrete mixing 

To see the relationship between discrete mixture models and local mixtures, consider a 
family of discrete finite distributions which shrink around their common mean fj,, 



where |/x — Oi{e)\ = 0{e x l 2 ), J27=i Pi®i{ £ ) = A*: J27=i Pi = 1> Pi — anc ^ ^ ^ s the indicator 
function. The mixture over such a finite distribution has the form 



Proof. See Anaya-Izquierdo [2]. 



□ 



It 



Q{e-,^e)=Y J PiI{0<O i {e)}, 



i=i 



n 




(6) 



i=l 



This has the asymptotic expansion 



r 




(7) 



3=2 
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where 



n 



! = 1 



and R(x,p,Q) = 0(e^ +1 " 2 ). There is a close parallel with the expansion (5) in Theo- 
rem 5. 

Definition 6. Following expansion (7), define the function $ by the weighted moment 
map 



A comparison of expansion (7) with those in Theorem 5 reveals interesting differences. 
In expansion (5), the fiber is centered at the pseudo-mean Mi and the order of the terms 
is u(i) = [(i + 1)/2J , while in (7), the asymptotic order is the 'more natural' i/2 and the 
expansion is around the exact mean. One reason for these differences is the requirement 
for a valid asymptotic expansion in Theorem 5 imposed by Watson's lemma (Wong [19]), 
that the tail of the (continuous) mixing distribution must have exponentially decreasing 
tails. There are, of course, no such restrictions for discrete mixtures, provided the number 
of components is known or bounded. 

Related to this difference is the idea of the smallness of the mixing. In Theorem 5, the 
idea of the mixture being close to the unmixed model was captured by the small variance 
of the mixing distribution. There is, however, a quite different notion in the discrete case. 
The simplest example of this is given by a two-component finite mixture 



for any regular family f(x;0). The form on the right-hand side shows the natural way 
that this mixture lies inside the affine space (^Mixj Vjyrix) +) since J f(x; 6q)v(<1x) = 1 
and J {f(x; 9i) — f(x; 9o)}v(dx) = 0. The new interpretation of when (9) is 'close' to the 
model f(x;9 ) is when p is small, rather than when 9i is close to 9q. 

The simple observation that there are mixtures which are arbitrarily close to an un- 
mixed model f(x;0o), but which can have components f(x;6\) which are far from being 
local, shows that the interpretation of local mixture models in terms of Laplace expan- 
sions is not exhaustive. 

4. Marginal inference on \i 

Suppose that the local mixture model g(x; n, A) is to be used for marginal inference on fi. 
Interpreting this in a Bayesian sense means that it is of interest to know if marginalizing 




It thus follows that 





pf{x- Bi) + (1 - P )f(x; 9 ) - f(x; 9 ) + p{f(x; 9^ - f{x; 9 )}, 



0) 
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over some subset of the parameter space for A is equivalent to marginalizing over a 
set of mixing distributions. The marginal posterior defined over some class of mixing 
distributions Q(/i), each with mean /z, has the form 

/ f[ [ f(xi;m)dQ(m)xir(n,Q)dP(Q), (10) 

for some prior 7r(/i, Q) and where dP(Q) is a measure over Q(/i). On the other hand, the 
marginal distribution over the local mixture has the form 

~ n 

/ l[g(x l ; f i,\)xir( f i,\)dP(\), (11) 

again for a prior 7r(/U, A) and where A(/x) is the set of parameters corresponding to dis- 
tributions in Q(/x). 

In order to describe classes of mixing distributions, first consider the following, appar- 
ently restrictive, possibilities. 

Definition 7. For a regular exponential family f{x;ji), let {M([i)} be a family of com- 
pact subsets of the mean parameter space such that [i € M(fi). Define Qm(^) t° be the set 
of distributions which have support on M(/j,) and have expected value /i. Furthermore, let 
Q-Mdi) be the subset of Qm(h) defined by the finite mixtures. Since each M(fx) is compact, 
its length can be defined by \M(fi)\ — maxM(/i) — mhiM(fi). 

The following result shows that marginal inference for [i over a local mixture model is 
asymptotically equivalent to that over all distributions with compact support, provided 
that the parameter space is bounded away from possible singularities in the log-likelihood 
function. 

Theorem 6. Let f{x\[i) be a regular exponential family and g(x;fi,X) the corresponding 
local mixture model of order r. Also, assume that the compact covering {M (fi)} satisfies 
\M(/j,) \ = 0(e 1/2 ). For each fi, let A(M(fi)) be defined by 

A(M(/i)):={$(Q)|QeQ^}. 



Suppose further that for all Q £ Q d 



g(x t ;u,$(Q))>C>0 

for every observed data point X{ . 

Under these assumptions, there exists a prior 7r(/x, A), depending on ir(ii,Q), such that 
i?2(e) bounds 



17 / f( Xi ; m) dQ(m)7r(n, Q) 1 dP(Q) 
ti Jm(u) 



634 



K. Anaya-Izquierdo and P. Marriott 



I \f[g( Xi ^,X)n(fx,X)\dP(X) 

JA(M(u)) - 



/A(M( M )) [ i=1 
where R 2 (e) = 0(e (r+1) / 2 ). 

Proof. Sec Appendix. □ 

Theorem 6 has the following interpretation. If A(M (/i)), the set of A-values of interest, 
is bounded away from any of the possible singularities in the log-likelihood, then, for 
a sufficiently small compact cover {M(/z)}, there is little loss in undertaking marginal 
inference on /i with the local mixture model, as compared to the set of all finite mixing 
distributions, Qm/u)- By weak convergence, this result extends to the space Qm((i), that 
is, all mixing distributions with support in the compact cover. 

Since many important mixing distributions do not have compact support, this result 
might still seem somewhat restrictive. Note, however, that as far as the contribution to 
the posterior is concerned, since 

f(xi]m)dQ(m)= f(x l ;m)dQ(m) + f(xi;m)dQ(m), 

there can only be a small loss in extending to distributions with uniformly small 'tail 
probabilities'. 



5. Overdispersion in binomial models 

There is a large body of literature regarding the problem of overdispersion in binomial 
models. In this section, it is assumed that the object of inference is either to learn about 
fj, = E(A) under an overdispersed binomial model or to find a good predictive distribu- 
tion. Two approaches are of interest here, quasi-likclihood and direct modelling through, 
for example, the beta-binomial model. For the first of these, see Cox [5], McCullagh [13] 
and Firth [8] and references therein; for the second, see Crowder [7]. Both approaches add 
nuisance parameters in order to take account of the overdispersion. This section looks at 
the way that inference, and the number of nuisance parameters, depends on modelling 
assumptions about the form of mixing and the configuration of the data. 

The binomial model has the simplifying advantages that its parameter space for [i = nn 
is compact and that it has a quadratic variance function. In order to classify the types 
of mixing, let Qc(aO be the set of finite distributions with support on a compact subset 
C C [0, n] and mean fi. Since any mixing distribution over binomial models is a weak limit 
of such distributions, it is clear that this is a sufficiently rich family for understanding 
all possible binomial mixtures. From Definition 6, a mapping from Q[ M _ e>M+e ] (/z) to the 
set of true local mixtures is defined by 



f(x; m) dQ(m; fj,, e) -> g(x; fi, $(Q)). 



Local mixture models 635 



SncarKt ond Tiiifd S*cqnd and Fnurrh Third *nd Fourth 




I IP 15 JO 25 5 Id t[ SB 25 -39 -W B rB BJ X 



Figure 2. The extremal points for the convex hull which characterizes true local mixtures of 
order 4 whose mixing distributions have a fixed compact support. The true local mixtures are 
represented by the corresponding central moments of the mixing distribution. 



Furthermore, from Theorem 1, it follows that the set of possible true local mixtures which 
lie in the image of <f> forms a compact and convex set. 

Following Teuschcr and Guiard [18], any distribution with mean p is the weak limit of 
mixtures of discrete distributions with two support points of the form 

Q(m; p) = pl(m < p x ) + (1 - p)I{m < /i 2 ), (12) 

where pp± + (1 — p)p2 = A* an d < p < 1. Such mixing distributions are the extremal 
points of the convex hull and provide a convenient way to characterize it. An example of 
such a set of points is illustrated in Figure 2. In this plot, for clarity, the central moments 

(E ((M - p) 2 ),E Q ((M - M ) 3 ),E Q ((M - /i) 4 )) 

are plotted for mixtures of the form (12). For fixed /i, the central moments are a linear 
transformation of the non-central ones, thus extremal points are preserved. 

The following result shows that this characterization of mixing distributions also char- 
acterizes true local mixtures, hence the integration used in Theorem 6, or any likelihood 
maximization, is implicitly over such convex sets. The result also directly links the first 
r moments of the mixture distribution to the first r moments of the mixing distribution. 



Theorem 7. If f(x; p) is a binomial model and $ the projection defined in Definition 6, 
then g(x; p, A) is a true local mixture model if and only if A = Q(Q) for some Q € . 

Proof. Sec Appendix. □ 



As e, which defines a set of mixing distributions Qf^L E , grows, the order, r, of the 
local mixture model needed to give a good uniform approximation to this set also grows. 
The required dimension can be measured by the variability of the posterior distribution 
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over compact convex hulls and by the hard boundary. For small values of s, the posterior 
distribution is essentially one-dimensional as the convex hull at each (i is very small. As 
s, increases, the posterior becomes essentially two-dimensional. It is in this region that 
the overdispersion methods described above are most effective. These methods essentially 
add one nuisance parameter, which is enough to model the flexibility in the posterior, 
and hence give good marginal inferences. 

Figure 3 shows what happens when mixing distributions with wider support are con- 
sidered. In panel (a), a sample from two well-separated binomial components is shown. In 
order to model this local mixture, models of degree six were selected and the correspond- 
ing maximum likelihood estimate is shown in panel (b) (circles) together with the best 
fitting unmixed model (crosses). Methods which used only one extra nuisance parameter 
were inadequate here, while the local mixture seems to fit well, giving a good predictive 
model. To see the effect on marginal inference, the data set in (c) was generated. It shows 
considerable skewness. Firth [8] investigated the efficiency of the quasi-likelihood method 
and notes that it does not work well when there is a large amount of skewness in the data. 
Again, local mixture models of degree six were chosen and the marginal posterior was 
calculated by numerical integration over a convex set defined for each value of y,. For this 
example, the conditions of Theorem 6 hold, thus marginal inference for the local mixture 
model is a good representative of that over all mixtures in Qf™_ s M _|_ E i • This marginal log 
posterior (more correctly, integrated likelihood) is shown by the dashed line in (d) with 
the solid line showing the log-likelihood for the unmixed model. 



Appendix: Proofs 



Proof of Theorem 1. First, note that from the standard properties of exponential 
families, all moments of f(x; /x) exist. Furthermore, from the form of the derivatives of 
exponential families in Appendix D, it is immediate that all moments of the local mixture 
g(x;ii,X) also exist. Hence, the local mixture model is mapped by M r into R r . 

(i) This result follows from Carathcodory's theorem (see Barvinok [4], Theorem 2.3) 
since a point lies inside the convex hull of a set in an r-dimensional affine space if it can 
be represented as a convex combination of at most r + 1 points of the set. Hence, for 
each i = 1, . . . , r, there exists a discrete distribution such that 

x l g(x\ii,\)dx= II x l f(x;m)dx\dQ^\(m) = I x l \ / f(x;m)dQ l _ l ,\(m)}dx. 



Thus the r-moments of g(x; /x, A) and J f(x; m) dQ^ h \(m) agree and so g{x; /x, A) is a true 
local mixture model. 

(ii) By assumption, when i = 1, . . . , r, 

x l g(x; fj,, X) dx = / x 1 I f(x;m)dQfj,.\(m)dx. 
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Figure 3. Two examples of data sets generated from mixtures of binomial models. Panel (b) 
shows the fitted binomial and local mixture models based on the data in (a). In (d), the marginal 
log-posterior for fi, for data shown in (c), calculated over mixtures with support fi ± 5, is com- 
pared to that for which there is no assumption of mixing. 



Since Q Mi a has support in M, it can be considered as the weak limit of a sequence Q n 
of discrete distributions with support in M, It is immediate that for each n, the point 
M r (J f(x; m) dQ n (m)) lies in the convex hull. Since M is compact, the corresponding 
convex hull is compact and hence closed; see Barvinok [4], Corollary 2.4. Thus the limit 
M r (J f(x;m)dQ l j, ! \('m)) also lies in the convex hull. □ 
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Proof of Theorem 2. By repeatedly differentiating the identity 

J xf(x;fi)dx = fj, 

with respect to p, it is easy to see that 

xf {k \x;p)Ax^Q 



for k>2. Hence, it follows immediately that for each p, the mean of g(x; p, A) is exactly 
p. 

It is sufficient to show that the A-score vectors are linearly independent, which follows 
by direct calculation. 

The orthogonality result follows immediately from Morris [15], who shows that if 
f(x; p) is a regular natural exponential family with p the mean parametrization and 
variance function Vf (p) a polynomial of degree at most 2, then the polynomials defined 
by 

P k{x ;p):=V f (p)^—^- 

comprise an orthogonal system. □ 

Proof of Theorem 4. Consider first a one-dimensional affine subspace of (^Mix, Vjyiix, +) 
which can be written as f(x) + Xv(x), where f(x) S XMix, v(x) S Vmix- The corresponding 
log-likelihood, defined on the convex subset of densities, is 

n 

l(X)=J2^g{f(x l ) + Xv(x l )} 

i=l 

and so 

d\ 2 ^(/(xO + Ai;^)) 2 ' 

hence it is concave. 

In general, consider any two points f\, ji in the fiber at /io which are density functions. 
The convex combination of j\ and ji is a one-dimensional affine space in the fiber hence 
the corresponding log-likelihood is concave. It follows that 

l{ P h + (1 - P)h) > p£(fi) + (l - P)t{h) 

for < p < 1 . The log-concavity for the fibers of the local mixture model inside the hard 
boundary therefore follows immediately. □ 

Theorem 6. First, consider the following lemma. 
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Lemma 1. If Q e &mu) and 1-^0") I =0(e 1/2 ), then from (8), it follows that 
f(x; m) dQ(m) - g(x; n, $(Q)) = R(x, fi, 0) = 0(e r+1/2 ). 



M( M ) 

In particular, there exists a bound 5(x,fi) on R(x,fi,Q) which is uniform for all mixing 
distributions in Q'jjj^ ■ 

Proof. The remainder term R(x, fi, Q) can be expressed, using Taylor's theorem, as 
M r+ i x /( r+1 ) [x, fj,*) for some fx* S M(/i). Since M(/i) is compact, there is a uniform 
bound for both /( r+1 ) and the M r+ i term for all Q G QmV)' Thus tnc rcsu lt follows 
immediately. □ 



Proof of Theorem 6. By direct computation, it follows that the marginal posterior for 

■)dis 



|U over Q^ 8 fu) , say p(fi), is given by 



P(fi)= I In/ /(^5 ™) dQ(m)7r(>, Q) I dP(Q) 

M, *(Q)) - R{xu fi, Q)M/i, Q) 1 dP(Q) 
i=i J 

(fi ste; n, 1 1 - W, Q) ) dP(Q) 

■ooU=i I- ^;m,a)J J 

•W w 1=1 9{xi\n,X)l J 

The assumptions of the theorem and the results of Lemma 1 give that 

R(x,/j,,Q)' 



n i 



g(xi;n,\) 



<1 + R 3 (e) 



where the bound i?3(e) is uniform in Q and of order e( r+1 )/ 2 . 

It thus follows that there exists a prior 7r(/x, A) = J/ 4 (q\_^\ tt(jU, Q) dP(Q) for which 
i?2(s) bounds 

/ lf[/ /(^;m)dQ(m)x7r(/x,Q)ldP(Q) 

In^^^xTr^^idPCA) 

A(M(^))l i=1 I 
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□ 

Proof of Theorem 7. Here, the proof for r = 4 is shown, which generalizes to any 
r. The polynomials x — fj,, p2(x,fj,), P3(x,fi), pi{x,^), defined in Example 1 from the 
derivatives of f(x;p,), are orthogonal (Morris [15]) and span the space of polynomials 
of degree less than or equal to 4. The remainder term R(x, fi, Q) defined in (8) can be 
expressed as a linear combination of terms f( k \x;n) for k > 4. By the orthogonality of 
derivatives of the binomial probability mass function, these terms satisfy 

J x l f {k \x;n)dx = (13) 

for i = 1, . . . , 4 and k > 4. It thus follows that the term R(x, fi, Q) does not affect the first 
four moments and hence, from (8), it is immediate that 

M 4 (g(x; n, $(Q))) =mJ [ f(x; m) dQ(m)) . 

Since it is easy to show from (13) that the first four moments uniquely characterize local 
mixtures of binomials, the result follows immediately □ 
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